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Sabatier 

We investigate the asymptotic behavior of one version of the so- 
called two-armed bandit algorithm. It is an example of stochastic ap- 
proximation procedure whose associated ODE has both a repulsive 
and an attractive equilibrium, at which the procedure is noiseless. 
We show that if the gain parameter is constant or goes to not too 
fast, the algorithm does fall in the noiseless repulsive equilibrium with 
positive probability, whereas it always converges to its natural attrac- 
tive target when the gain parameter goes to zero at some appropriate 
rates depending on the parameters of the model. We also elucidate 
the behavior of the constant step algorithm when the step goes to 0. 
Finally, we highlight the connection between the algorithm and the 
Polya urn. An application to asset allocation is briefly described. 

Introduction. The aim of this paper is to deeply investigate the asymp- 
totic behavior of the so-called two-armed bandit algorithm. This stochastic 
approximation procedure is widely known in the fields of mathematical psy- 
chology and learning automata (see [13] and [15]). Our own motivations are 
both theoretical and practical as it will be seen further on. Let us first in- 
troduce the algorithm itself in a financial context, namely as an adaptive 
optimal asset allocation model. 

Imagine a fund managed by only two traders, say A and B: every day 
each of them is in charge of a percentage of the fund, which may vary from 
day to day. The few wealthy investors (the shareholders) who created the 
fund wish ideally to allocate the whole fund to the most efficient trader, 
but of course they do not know who he is. They simultaneously want to 
make some advantage of the performances of the best trader as soon as 
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possible. This means they need to devise a periodic re-allocation procedure 
of the fund to the traders based on their (daily or monthly) performances. 
On the other hand, this procedure should be not too "upsetting" to the 
traders in order to preserve their motivation and self-confidence: one way 
is to enhance reward rather than punishment. Taking all these specifica- 
tions into account suggests to proceed as follows: let X n be the fraction 
of the fund managed by trader A during day n, the fraction 1 — X n being 
managed by trader B. Every day, one trader is chosen at random and his 
performances of the day are evaluated. Assume it is A for a moment. If they 
are considered as outstanding, trader A is rewarded by an extra-allocation 
for day re + 1 of 7n+i times the fraction managed by trader B during day 
re (whatever the performances of trader B are since he was not checked). 
So trader A will manage a fraction X n + 7 n+ i(l — X n ) of the fund during 
the day n + 1. If his performances are not high enough to deserve a reward, 
nothing happens. The same procedure is applied to trader B when he is 
checked: if B has outstanding performances, he is awarded an extra alloca- 
tion 7 n +i times the share managed by A during day n so that, during day 
n + 1, the share managed by A will be reduced to X n+ \ = X n — ^ n+ iX n 
(whatever his performances on day n were). One models the daily perfor- 
mance evaluations of A and B by two sequences of events (A n ) n >\ and 
(B n ) n >i, respectively: A n = {^4's performances on day n are outstanding} 
and B n = {S's performances on day n are outstanding}. 

A natural policy for the investors of the fund is to reduce the risks induced 
by this strategy by controlling the largest possible part (in average) of the 
whole fund. So tossing up for the checked trader with a fair coin is not 
appropriate. What seems more efficient is to use for the daily toss a biased 
(virtual) coin so that the probability for trader A or B to be checked at the 
end of day n is equal to the share of the fund they managed that day, namely 
X n and 1 — X n , respectively. This virtual coin can be tossed by generating 
on a computer some i.i.d. random numbers U n , n > 1 and by setting 

{A is checked at the end of day n} = {U n+ i < X n }, 

{B is checked at the end of day n} = {U n+ \ > X n }. 

All this leads to the following dynamics for X n : for every n > 0, 

{l) Xn+l = Xn + 7n+1 ^ 1 ~ Xn ^> 1 {Un+i<x n }nA n+1 ~ X n l{ Un+1>Xn }nB n+1 ), 
X = xe[0,l], 

where (7 n )n>i is the sequence of gain parameters (or steps) satisfying 

(2) VnGN*, 7„£ (0,1) and r n :=7iH h7n^+oo as?i^+oo. 

[Note this includes the constant step setting 7„ = 7 G (0, 1).] The fact that 
7 n lies in (0, 1) is induced by the modelling (it is a percentage). On the other 
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hand, the fact that T n goes to infinity is a necessary condition to "forget" 
the starting value: if lim n r„ < +00, X n would still converge a.s. toward a 
random variable X , but one could not show that X takes its values in 
{0,1}. 

This recursive random procedure was first introduced by Norman in math- 
ematical psychology (see [13]) and then, independently, by Shapiro and 
Narendra in the engineering literature as a linear learning automata (see [15]). 
In this field it is known as the Linear Reward-Inaction (L#_/) scheme (see 
the survey [10] and the book [11] by Narendra and Thathachar about learn- 
ing automata theory). In both cases, only the constant step setting is con- 
sidered. The application to optimal adaptive asset allocation in a financial 
context has been developed in [12]. 

The algorithm (1) is often mentioned in the literature about stochastic 
approximation and recursive stochastic algorithms, this time mainly in its 
decreasing step version (see [5]), as the two- armed bandit. In fact, from a 
mathematical point of view, it is one of the simplest examples of a stochas- 
tic approximation algorithm having a "noiseless trap." We will come back 
further on this property which was another motivation for investigating this 
algorithm. 

The sequence {U n ) n >i and the events A n , B n , n > 1 are defined on a 
probability space We will make some further assumptions on the 

events A n and B n , namely that the sequence 

CU n ,li?Jn>l isi.i.d. 

This assumption corresponds to a "stationary" situation: the traders' daily 
performances are supposed to be independent and "statistically invariant," 
that is, identically distributed: so one sets 

F(A 1 )=p A and F(B l )=p B . 

Of course, the owners of the fund do not know whether pa > Pb or pa < Pb- 
Finally, one assumes that the sequences 

(U n ) n >i and (l J 4 n ,l J B„) n >i are independent, 

that is, the daily tosses are in no way influenced by the respective past (and 
future) performances of A and B except for the shares respectively managed 
that day. 

To elucidate the a.s. asymptotic behavior of this allocation procedure, 
one could call upon classical stochastic approximation methods like the so- 
called ordinary differential equation (ODE) method. It consists in comparing 
the asymptotic behavior of the algorithm (X n ) n >i with that of the related 
ODE = x = irh(x) where ir :=pa ~Pb and irh(x) := ^— ^-E(X n+ i — X n \X n = 
x) = irx(l — x) is the mean function of the algorithm (see Section 2). One 
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readily checks that this ODE admits two equilibria, and 1, and that, when 
Pa > PB-, hs flow $(x,t) = ^_ x ^_^ t+a , uniformly converges on compact sets 
of (0,1] toward 1 as t — > oo: the equilibrium 1 is stable with an attraction 
interval (0,1]; on the other hand is repulsive (0 is then called a trap for 
the algorithm). Thus, the celebrated "conditional convergence" theorem due 
to Kushner and Clark in [8] says that, under technical assumptions fulfilled 
here, almost every path of the algorithm that visits infinitely many times a 
compact subset of the attracting interval of a stable equilibrium will converge 
toward it. Applying that to a path of the two- armed bandit algorithm shows 
that if it does not converge to 0, then it necessarily visits infinitely often the 
compact interval [e, 1] for some e > and, hence, converges toward 1. 

In some way it is not really surprising that this approach fails since stabil- 
ity is a second-order property, whereas the ODE method is based on a first- 
order approximation. Recent sophisticated first-order approaches like [2] 
cannot be more efficient for the same reason. 

There is a wide literature in stochastic approximation about traps and 
how not to fall into them (see [3, 6, 9, 14, 16, 17]). They all rely on the fact 
that, if the noise is exciting enough at a repulsive equilibrium x* , then a.s., 
the algorithm will not converge to it. By "exciting enough" one means that 
a conditional variance term at x* is positive. But the main feature of the 
two-armed bandit algorithm is that its two equilibria (0 and 1) lie at the 
boundary of its state space [0, 1] , so the above conditional variance term is 
necessarily identically at the repulsive equilibrium x* = (and at x* = 1 as 
well). So, the behavior of the two-armed bandit algorithm cannot be solved 
using these approaches. 

As far as we know, from a mathematical point of view, the asymptotic be- 
havior of the algorithm has not been elucidated in the literature. The present 
paper derives from results obtained independently by the third author in [17] 
and the other two authors. 

Heuristics, probably suggested by the behavior of the mean algorithm, 
seems to consider that the procedure described above works well in practice. 

It is interesting for both theoretical and practical motivations to analyze 
the behavior of the two-armed bandit algorithm, that is: 

• Is it possible to choose the gain parameter sequence so that the algo- 
rithm a.s. never fails? 

• Conversely, does the algorithm "fall in its noiseless trap 0" for some 
sequences of gain parameters? 

This leads to introduce the following terminology when <pb <Pa < 1- 
(Inverting the role played by A and B solves the case < pa <Pb < 1-) The 
two-armed bandit algorithm is: 

• fallible when starting from x £ (0, 1) if ¥ x (X n — > 0) > 0, 
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• a.s. infallible if ¥ x (X n — > 0) = for every x € (0, 1). 

Although not directly interested by the critical case pa=Pb, we will 
deeply investigate it since it is a key to solve the general case thanks to a 
comparison result. 

The paper is organized as follows. In Section 1 is stated the main theo- 
retical result of the paper, namely Theorem 1, concerning the convergence 
and the fallibility of the algorithm. Two corollaries show its consequences on 
usual parametrized families of steps for which some necessary and sufficient 
conditions of infallibility are derived. 

Section 2 is devoted to some elementary, although important, facts on 
which relies the proof of Theorem 1, Proposition 2 on one hand and the 
comparison result stated in Proposition 3 on the other hand. Section 3 is 
mainly devoted to the proof of items (b) and (c) of Theorem 1 [item (a) 
is elementary]: Section 3.1 solves item (b) and Section 3.3 solves item (c). 
Section 3.2 has a particular status: it is a kind of bridge between Sections 
3.1 and 3.3: we focus on the special case where the step 7„ is constant 
which is the historical setting considered by those who devised the proce- 
dure. It is shown in Theorem 2 that the (positive) probability of failure 
for the algorithm with constant step 7 goes to as 7 goes to zero. Some 
bounds are displayed, the optimality of which are not known to us. Section 4 
makes a connection between regular Polya urns and the two-armed bandit 
algorithm: we show that the two-armed bandit algorithm can be seen as 
a generalized Polya urn. Thus, we retrieve partially the infallibility results 
of Theorem 1 using standard methods of proof for the Polya urns like the 
"moment method" and the log-method. In the martingale case (pa = Pb) 
these approaches yield some more information about the distribution of the 
a.s. limit X x of X n . In Section 5 some first elements about the rate of 
convergence of the algorithm are provided that emphasize its nonstandard 
behavior among stochastic approximation procedures. Furthermore, some 
stopping rules are derived for the algorithm, inspired by some method of 
proof for infallibility. The last section contains some provisional remarks 
and additional results. 

Note that, except for the notations and the elementary facts contained in 
Section 2, other sections are self-contained and can be read independently. 

Notation, (i) The letter C will denote a positive real constant that 
may change from line to line. 

(ii) The letter £ will denote a random positive real constant that may 
change from line to line. 

(iii) Let (a n ) n >o and (b n ) n >o be two sequences of positive real numbers. 
The symbol a n x b n is for a n = 0{b n ) and b n = 0(a n ), whereas the symbol 
fln ~ b n means lim n a n /b n = 1. 
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1. The main result. 

Theorem 1. (a) Almost sure convergence. 

(i) // < pb < Pa < 1 and x £ (0, 1), (X n ) n >o is a bounded submartin- 
gale, hence ¥ x -a.s. converging toward a random variable X^. The random 
variable takes values in {0, 1} and 

IPxP^oo = 1) = X + 7T ln+l^x{h{X n )) > X + 1T-/ I x(l - x) > X. 

n>0 

(If Pb = and pa > 0, then, X n is nondecreasing and converges toward 1.) 

(ii) If < pb = Pa < 1 and x E (0, 1), then (X n ) n >o is a bounded mar- 
tingale ¥ x -a.s. converging toward a random variable X^. 

Moreover, ifJ2n>o^/n+i = +°°; -^oo is {0, l}-valued with distribution Bernoulli(x). 
(If Pb = PA = 0, then X n = x, P x -a.s. for every n > 0.) 

(iii) If x € {0, 1}, then X n = x, P x -a.s. for every n > 0. 

(b) Convergence to with positive probability. If 

n 

(3) E IK 1 -^7fc)<+oo 

n>0fc=l 

then, for every x E [0, 1) 

P x (X oo =0)>0. 

In particular: 

(i) if < pb < Pa < 1 j then, for every x E (0, 1), the two-armed bandit 
algorithm starting from x is fallible; 

(ii) if < pb =Pa < 1 and J2n1n < +oo, then, for every x £ (0, 1), 

f x (X OD = 0), W x (X OD = l) and F^X^ £ (0, 1)) > 0. 

(c) Convergence to a nonzero value. Assume < ps < Pa < 1 and 

(4) 7n = 0(r n e-^ r "). 
T/ien, /or ever?/ x £ (0, 1], 

ny*oc = o) = o. 

In particular: 

(i) i/ < pb < PA < 1 t/ien, /or every x £ (0, 1], 

Xoq = 1 Pa.-o.s-, that is, the algorithm is a.s. infallible, 

(ii) when < ps = Pa < 1 t/ien, /or every x £ (0, 1), 

XooGCO,!), P^-a.s. 
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Proof. This theorem follows from Propositions 2, 4 and 5. These propo- 
sitions can be seen as steps of the proofs of the theorem. □ 

We will derive in Corollaries 1 and 2 how the above step assumptions 
(3) and (4) for fallibility or infallibility read on some natural parametrized 
families of step sequences. 

But, first, we will shortly enlighten some connections between the different 
step assumptions appearing in the statements of the above Theorem 1. 

(i) In = 0(T n e~ PBTn ) => J2n7n < +oo: see the remark after Proposi- 
tion 5, Section 3.3. 

(ii) J2n rR=i(l — PBlk) = +oo 7^ J2n7n < +°° : a counter-example is pro- 
vided by 

7 2 » = ^^=, ra>0 and 7 fc = if k {2 n ,n>0}. 
yn+ 1 

Corollary 1 (Fallibility). Let p B G (0, 1]. 

(a) Constant step. If the step 7„ := 7 G (0, 1), the two-armed bandit algo- 
rithm does converge toward with positive probability. Namely, 

VxG(0,l), V x {X oo = 0)>(l-x) 1 ^>0. 

(b) Power step (I). One considers the family of "power" stepson := (^^j) a , 
< a < 1, C > 0, n> 1. These step sequences satisfy assumption (2). 7/ 

(0 < a < 1) or (a = 1 and C > 1/pb), 

then, for every x G [0,1), P x (X oo = 0) > (i.e., i/ie algorithm is fallible 
from x ). 

(c) In particular, if < pb < Pa < 1> two- armed bandit algorithm is 
fallible starting from any x G [0, 1) /or i/te step sequences specified in the 
above items (a) and (b). 

Proof, (b) The above condition on C and a implies that assumption (3) 
of Theorem 1 is fulfilled. 

(a) The lower bound for ¥ x (X OQ = 0) needs further care. It relies on (9) 
established in the proof of Proposition 4: setting 7„ = 7 G (0, 1), it reads 

Pa^oo = o) > ( n ( 1 - x n a - ifl h 7fc) > ) ) • 

\n>l\ k=l ) ) 

Then the computations can easily be carried on: the Jensen inequality yields 
P^Xoo = 0) > expf ]T Ealogtl - with Z n := J] (1 - 7 1 B J. 

\n>l / fc=l 
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Now 



x m 



E x (log(l-xZ n )) = -J2—nZ™) 
^rC m 

m>l 

= -E^(( 1 "7)>B + l-^" 



m>l 



so that 



1 x m 
Y, Ex(log(l - xZ n )) = n n 

1 

> logfl — x). 

VBl 

Finally, for every x G [0, 1], 

(5) P x (X oo = 0)>(l-x) 1 / p ^. □ 

Corollary 2 (Infallibility). Xe£ 0<pb<pa<1- 

(a) Power step (II). Let j n := (^) a , < a < 1, C > 0, n > 1. T/ien, 
P x (a" 00 = 0) = /or every x £ (0, 1] i/ and onZy i/ 

a = 1 and C < — . 

(b) Power step (III). Set 7 n := 1+Al ^". +A , n > 1, where (A n ) n >i is a 
sequence of positive real numbers satisfying A n ~ Cn 1 /^ -1 log a n /or some 
a > and C > 0. These step sequences satisfy assumption (2) since 7 n ~ 

T/ien, W x {Xoo = 0) = for every x G (0, 1] i/ and only if 

1 

a< — . 

PB 

(c) /n particular, if < p# < pa < 1, i^e two- armed bandit algorithm is 
a.s. infallible for the step sequences specified in the above items (a) and (b), 
i/iai is, 

Vx 6(0,1], ¥ X (X 00 = 1) = 1. 

Note for practical implementation that the step sequence j n = ^xj, n > 1, 
corresponding to C = 1 always satisfies item (a) regardless of the value of 
Pb since 1 < 1/pb- 

Proof, (a) First, assumption (2) is clearly fulfilled. Now, in view of 
Corollary 1(b), we just need to prove that assumption (4) of Theorem 1 is 
satisfied if a = 1 and Cps < 1- If a = 1, T n = Clogn + C" + o(l). Conse- 
quently, assumption (4) reads 1/n = 0(log(n)n~ CpB ), that is, Cps < 1. 
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(b) One has S n := 1 + A 1 + • • • + A n ~ Cp B n x l pB i og ° „ so that ln ~ 
Now assume that aps < 1. We need to check assumption (4) of Theorem 1. 
Notice that >c e r ™ [see the preliminary remark after Lemma 1 in Sec- 
tion 3.3 for more details, especially (15)]. Therefore, assumption (4) re- 
duces to 7„ = 0(T n S~ PB ), which follows from T n ~ — logn and S~ PB ~ 

Pb 

(Cpb)~ Pb i og ap B n ■ We now prove that if aps > 1, assumption (3) holds. We 
have 

Further remarks on the step assumptions, (i) It follows from 
Corollary 2 that there exist sequences of steps 7„ and 7^ satisfying 7 n ~ 7n ~ 
and such that the corresponding algorithms X n and X' n are fallible and 
infallible, respectively. In fact, the critical case for infallibility is not entirely 
elucidated by the above results. 

(ii) The asymptotics of the constant step setting when the step 7 goes 
to is elucidated in Theorem 2, Section 3.2. 

2. Some elementary facts. The random innovation at time n is clearly 
£n := {U n ,lA n ^B n ) (the e n 's are i.i.d.). Set T n := cr(ei, . . . ,e n ), n> 1 and 
JF := {0,0). We denote by T_ the filtration {!F n ) n >Q. It follows from (1) 
that (X n ) n >i is obviously a [0, l]-valued £-Markov chain (homogeneous if 
7 n = 7). For notational convenience, we will denote by ¥ x the distribution of 
the whole sequence (X n ) n >o starting at x G [0, 1]. One also derives from (1) 
some straightforward properties of the algorithm. 

Proposition 1. For every x G (0,1) and every n > 1, X n £ (0,1). On 
the other hand, both states and 1 are absorbing, that is, if x G {0,1}, 
X n = x for every n > 1, F x -a.s. 

Viewed as a stochastic approximation procedure, its canonical form reads 
(6) X n = X n _i + 7n ^/i(X n „i) + 7 n AM n , 

where 

7r :=PA-PB, h{x) := x(l - x) 

and 

AM„ := (1 — X n -i)l{ Un < Xn l } nAn - X n ^il {Un>Xn l}nBn -xh(X n -i) 
is an ^-martingale increment. 
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Remark. The mean algorithm associated with (X n ) n >i is the deter- 
ministic recursive procedure defined by 



It can be solved very easily: when ir = pa — Pb > and xq E (0,1], the 
sequence x n is [0, l]-valued and nondecreasing, hence converging toward x x . 
Since the series J2nlnh(x n -i) < +oo, whereas J2n7n = +oo, it is obvious 
that h(xoo) = 0. Hence, Xqo = 1 since > xo > 0. So, the mean algorithm 
never fails in pointing out the best trader since it asymptotically assigns the 
whole fund to be managed by A when pa> Pb (and by B when ps > Pa)- 
Unfortunately, it needs to know a priori who is the best trader, that is, 
whether pa > Pb or ps > pa- 

Similarly, (6) shows that (X n ) n >i is a bounded submartingale and one de- 
rives (see Proposition 2) that then P^-a.s. X n converges toward a {0, l}-valued 
random variable X^ if Pa^P b - But this time, there is no straightforward 
argument showing that the procedure always points out the best trader, for 
example, X x = 1 P x .-a.s. when pa > Pb and x £ (0, 1]. The next proposition 
yields some first answers about the behavior of the algorithm. 

Proposition 2. (a) Submartingale case. // < ps < Pa < 1 and x 6 
(0,1), (X n ) n >Q is a bounded T_- submartingale, hence ¥ x -a.s. converging to- 
ward a random variable X^, taking values in {0,1} and 



(If Pb = and pa > 0, then, X n is nondecreasing and converges toward 1.) 

(b) Martingale case. If < pb = Pa < 1 and x G (0, 1), then (X n ) n >o is 
a bounded ^-martingale F x -a.s. converging toward a random variable X^. 
Moreover: 

(i) if 12n>o7n+i = then I M is {0, 1} -valued with Bernoulli dis- 
tribution B{x), 

(ii) ifJ2 n >o7n+i < +°°; then X^ is [0, l]-valued and satisfies F X (X 00 £ 



(If Pb =PA = 0, thenX n = x, F x -a.s.) 

Proof, (a) (X n ) n > is obviously a bounded ^-submartingale. Further- 
more, its a.s. limit, say X^, satisfies 



X n +l = X n + J n+1 irh(x n ) 



^oG [0,1]. 



F x (X n -> 1) = x + vr j n+1 E x (h(X n )) >x + iryix(l -x)>x. 



n>0 



(0,1)) >o. 



E a ,(X 00 ) = limE x (X rt ) 
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= X + 7T ^ 1n+l^xh{X n ). 

n>0 

Hence, Y,n>o7n+i^x(h(X n )) < +00 since ir > and consequently, En>o7n+iM^n) < 
+00 a.s., which in turn implies liminf n h(X n ) = since h is nonnegative and 
J2 n >o 7n+i = It follows that h(X OQ ) = so that Xoo G {0, 1} a.s. Finally, 
P X (X 00 = 1) = E X (X 00 ). 

(b) (X n ) n >o is obviously a bounded ^-martingale. When pa = Pb, an 
elementary computation shows that 

E x (X n (l - X n )) = (1 -pA7n)Ex(^n-l(l " *n-l)) 

(7) 

= x(l-a:)n(l-PA7*) 
fc=i 

so that 

n>l 

The announced result follows since the infinite product converges toward a 
nonzero limit iff J2nln < +00. □ 

One may specify without loss of generality the definition of events A n and 
B n : these two events never interact so only the marginal distributions of Iai 
and 1b 1 are involved in the distribution of the whole sequence (X n ) n >Q. So, 
one sets 

(8) A n :={V n < PA } and B n := {V n < p B }, 

where (U n ) n >l, (V n )n>i are two independent i.i.d. U([0, l])-distributed se- 
quences. 

Then, this "coupled" algorithm is pathwise monotonous as a function of 
PA, the parameter ps being fixed. This is established in the proposition 
below. 

Proposition 3 (Pathwise comparison result). Let x G (0,1). Let (X n ) 
and (X^) denote two "coupled" two-armed bandit algorithms built from the 
sequences (U n ) and (V n ), starting from x < x' and associated to the parame- 
ters (pb,Pa) and (pb,p'a)> respectively, with pa < p'a- Then for every n G N, 

X n < X' n . 

Ln particular, 



{X' n ^0}c{X n ^0}. 
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Proof. The result follows from what happens between time and 1. 
One inspects the four possible cases following: 

(i) On {C/i < x'} n {Vi < p' A }, X[ = x' + 7l (l - x') and X x < x + 7i(l - 
x)<X[. 

(ii) On {Ui < x'} n {T4 > p' A }, X[ = x' and X x < x < x' . 

(hi) On {Ui > x'} n {Vi < p B }, X[ = (l- ji)x' and X x = (1 - 71 )x < ^. 
(iv) On {E7i > x'} n {14 > X[ = x' >x = X 1 . □ 

Remark. One checks that, when pa > > the trajectories of the general 
form of the algorithm are nondecreasing as a function of their starting value. 
In particular, the function x 1— > ¥ x (X OQ = 1) is nondecreasing. 

3. When does the two-armed bandit algorithm fail? 

3.1. Quite often . . . 

Proposition 4. If J2 n >oUk=i(^ ~PBik) < +00 then, 
F x (X n ^0)>0 for every xe [0,1). 

Proof. One considers the event 

Doc ■= j^n > x Y[ (1 - TfclSfc) for evel T n > l|) 

where Ilfc=i = 1- One checks by induction that, on Dqo, X n = i[|t = i(l ~~ 
7felB fc )- The algorithm is nonincreasing toward its limit X^. Hence, 

E x (l Doo Xj < UmE x (l Doo X n ) < limEJ x f[(l - lk l Bk ) J 



](l-pB7n)=0 



n>l 



since > 0, so that = sur D^. On the other hand, (U n ) n >i being 
i.i.d., uniformly distributed and independent of the sequence (B n ) n >i, 



n-l 



P x (Doo) = E x (F x (D OQ /a(B n , n>l))) = E a ( J] [ 1 - x JJ (1 - l Bfc 7 fc ) 
Consequently, 



,n>l \ fe=l 



(9) p^xoc = 0) > e J n ( 1 - x ii c 1 - ^k)) ) ■ 

\n>l\ fe=l // 
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Now, the events Bk are independent, hence 




M e n t 1 - ^7*) = e n c 1 - PS-*,) < +00 



n>lfc=l / n>lfc=l 



so that J2 n >i niLi (1 — 1 B fc 7fc) < +00 P x -a.s. Consequently the infinite prod- 
uct rin>i(l — ^ni—i (1 — IfifeTfe)) converges toward a P^-a.s. positive random 
variable. Hence, ¥ X (D 00 ) > 0. □ 

3.2. Especially with constant step although... The fallibility result ob- 
tained in Corollary 1(b) for the algorithm with "slowly" decreasing step is 
to be compared with the asymptotics of its behavior with constant step. 
We know from Corollary 1(a) that, if < pb < Pa < 1 and j n = 7 G (0, 1), 
then, for every x G (0, 1], the algorithm with step 7 [denoted (X2) n >o in this 
paragraph] does fail with positive probability: namely, it converges P x -a.s. 
toward a {0, l}-valued random variable satisfying f(X2o = 0) > 0. The 
following theorem shows, however, that the probability of failure goes to 
as 7 — > 0. 

The fallibility of the algorithm with constant step and this property is 
known by specialists in Learning Automata theory (see the discussion in 
Chapter 5 in [11]), although not clearly established mathematically in full 
generality. 

Theorem 2. Assume that < ps < Pa < 1 and j n = 7 G (0, 1). Then, 
for every x G (0, 1] , 



where P 7 (x,dy) denotes the Markov transition probability of the two-armed 
bandit algorithm with constant step 7 G (0,1). A straightforward computa- 
tion shows that for any function / : [0, 1] — > R, 




hence 



hm V^Xl = 1) = 1. 



Proof. We focus on the function 




P 7 (/)(x) := E x (f(Xi)) 



p A xf(x + j(l -x))+p B {l -x)f(x(l -7)) 

+ (1 - p A X - p B (l ~ x))f(x). 
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At this stage, it is convenient to observe that, for every function g : [0, 1] — > R, 

P 7 (gh) =hQ 1 {g), 
where the operator Q 7 is defined by 

Q y (jg)(x) = (1 - 7) (pa(x + 7(1 - x))g(x + 7 (1 - x)) 

+ p fl (l-a:(l-7))fl(a(l-7))) 

+ (1 -p a x-pb0- - x))g{x). 

It is clear that ^7 satisfies ^7 = ^1X71 where X7 := En>o^"(l)- One shows 
by successive inductions and a little elementary Calculus that x-y an d x 1— ► 
— x-yi x ) are absolutely decreasing functions [an infinitely differentiable 

function / is absolutely decreasing on (0, 1) if its successive derivatives 
satisfy > for every n> 0]. On one hand, one derives that V7 is 

indefinitely differentiable on (0, 1] and, on the other hand, that 

(10) |X»|<— U and 0< X ' 7 (x)< 



7T7X ' 7T7X^ ' TTJX' 3 

Then, it follows from the definition of ip 7 that 
(11) ^ 7 - P^ipy = h and P^X^ = 1) = x + irjip 7 (x). 

The first of these two identities reads 

p A x(^(x) - ip 7 (x + ~f(l - x)))+p B (l - x)(ip y (x) - ipyiO- - 7)^)) 
= x(l — x), 

that is, 

Pax ip 7 (x + t(l — x))(l — x)dt + pB(l — x) ipiyix — tx)(- 

J 7 J 7 

= x(l — x). 

Hence, simplifying by x(l — x) for every x £ (0, 1), yields 

PA [ 4>'~,(x + t(l - x)) dt — pb [ ip'(x — tx)dt = —l. 



-x) dt 



Now, 



^{x + t(l - x)) = ij) 7 (x) + i)"(x + a(l - x))(l - x) ds, 
tp'Jx - tx) = ip'Jx) + / ip!l(x(l — s))(— x)ds 
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so that 

7T7^(x) + pa(1 — x) J J ip"(x + s(l - x)) dsdt 

+ Pbx [ [ ip''(x — sx)dsdt = —l. 
Jo Jo 

Combining a rewriting of this identity with an obvious inequality leads to 

-^(X) = i- (l + £ J* (p A (l - X)^{X + 8(1 ~ X)) 

+ pBXip"(x(l - s)fj ds dt 

1 p A (l-x) +p B xj 2 „ 
> Tj- sup V 7 («) • 

Plugging inequalities (10) in the equality ip" = x"h + 2% 7 /i/ + x-yh" yields 

W(l) l<*fiz|> + 3iz*l + _L< 4 



7 7T7X 2 7T7X 2 7T7X 7T7X 2 

and consequently, 

if/ \ — PA -vrx 2 7 2 

-V 7 (a;) > 7j \2~~2 • 

' 7T7 7T7 7T7(1 — 7J Z X Z 

Now ^ 7 (j,)=^ 7 (l) + / tf 1 (-^(u))d« = / y 1 (-^(«))d« (V 7 (1)=0), h ence, 

7T7 7T Z (1 — 7J Z \ X, 

Finally, one comes to 



3.3. -Bwi noi always 



Proposition 5. Assume < ps < Pa < 1 and assumption (4). TTiera, 
/or every x G (0, 1], 

(12) p x (X oo = 0) = 0. 

Remark. As soon as p# > 0, assumption (4) implies J2n>iln < +°o: 
for n > 1, 



fe=i fe=i 

/T n / , +00 

< C / ue~ PBU du<C ue~ PBU du < +oo, 
Jo Jo 
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since u \— > ue~ PsU is nonincr easing for u large enough. 

Lemma 1 (Case pa = Pb = !)• Assume pa =Pb = 1 o,nd set Aq := 1 and 
:= rr n 7 n_ — v f or n — V the sequence (A n ) n ^ satisfies 

(13) A n = o(r n ), 

then, for every x G (0, 1], P x (Xx> = 0) = 0. 



Preliminary Remark. With the notation of the lemma, we have 
(14) 



7n = -7T 1 with5 n = ^A fc . 

k=0 



The partial sums S n and T n satisfy, for every n > 1, 



(15) 



\o g s n - V < r n < io g s n . 

, t 1 -'Yfc 



fe=l ^ T fc 



This follows from the easy comparisons (with So = Ao = 1' 

n ' Sn citt 



r -V^ 



< 



1 u 



log S. n 



>logS n -£ T 



7 fe 



fc=l 



7fc 



Hence, S n — ► oo as n — ► oo since T n — > oo and 7 n < CT n /S n < CT n e r " . 
Consequently, the former remark applies with pb = 1 and shows that 

E ^ < +°°- 

n>l 

Finally, assumption (13) implies 
(16) r n ~log5 n and S„xe r " as n — > oo. 

Proof of Lemma 1. The algorithm can now be rewritten as follows 
(we assume pa= Pb = 1): 



Hence, 



Sn+l^n+l — S n+ iX n + A n+ i(l{f/ ji+1 <x n } — X n ) 
Sn+l-^n+l = SnX n + A n+ i l{(7 n+1 <X n } ■ 
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Let Y n := S n X n . We will first prove that lim n Y„ = +00, a.s. Since the se- 
quence (A n /r n ) n >i is bounded and E(l{ Un+1 <x n }\^n) = X n , we have (see, 
e.g., Theorem 2.7.33 in [4]) 



D A„. +1 . ^ A 



n+1 -1 I I \ " L -^n+l v I 

X„ = 00 > a.s. 



U=o in+1 J U=o Ln+1 

But 

n n+1 

A n+1 X n > A n+i x (1 - 7 fe ) > A n+1 x J| (1 - 7 fc ) = 7„+i2; 
fc=i fc=i 

so that 

OO A OO 

E^n+l v . V- 7n+l . 
p ^n > S 2^ p = +OO, 

n=0 1 n+1 n=0 1 n+1 

since r n — > 00 and j n — > as n — ► 00. Consequently, the nondecreasing se- 
quence (5 / n )n>o satisfies 

n-i ^ 
limsup Y n = limsup £ A fc+1 l {c/fc+1 < Xfc} > 71 £ j^l{U n+1 <x k } = +00. 

n>0 1 n+1 



Next, we prove that limsup n lo ^"g = +00 a.s. One may write Y n = x + 

Efc=d A fc+i 1 {t/ fc+ i<i'fc/Sfc} so that for an y A > °: 

limsup wV - limsup wV where Z ™ = ^ A k+i 1 {u k+1 <x/s k }- 
We have 

n.-l 



and 



/ \ \ 

E^ = £A fc+1 inin^l,— J 



Var(Z^) = £ A| +1 min(l, A) (l - min(l, A 



fc=0 

71-1 



< C log S n £ A fc+1 min ( 1, 

k=0 



S k 



= ClogS n EZ*. 

Consequently, 

p(|z* - Ez n A | > pEzl) < c lo li n ^ z ; < c- 
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One checks that lim w ^ fc =° A w ™ m ( 1 ' A / 5 ' fc ) _ ^ since A n min(l, X/S n -\) ~ 

Let A* = {\Z^ -EZ*| < pEZ^}. For A large enough, F(A^) > 1/2, so that 
P(limsup n ^4 A ) > 1/2. Now, on the event limsup n A A , 

Z*>(1- p)EZ* > A(l - p) log S n for infinitely many n. 

Hence, P(limsup n i og g n > A(l — p)) > 1/2. But the random variable limsup n 
lies in the asymptotic cr-field of the i.i.d. random variables U n 's, hence, 

Z x 

limsup-— %->A(l-p), P x -a.s. 

n log S n 

This holds for every p > and A > so that limsup n lo gg = +oo, P^-a.s. 
On the other hand, for any positive integer p, 



E((X 00 - X p f\F p ) = E I £ 7fc 2 +i^(l - ^ 



Now observe that 

'2 



E(l {Xoo=0} X 2 |^) -X p ) 2 |^) 



p^ = o|^ p ) = v p ' y/ < 



V°° 'V 2 V°° -V 2 

< 2^k= P ik+i _ g l^k=plk+i 
X p Yp 



^ c y L i?2 Ak - c V L -^du 

1 V k>p+l °k I P J S P 

< x lQ g^P = r ^§^ 

— V o y- ' 

One concludes by noting that the bounded martingale ¥ x (X OQ = 0\J- p ), 
1 x -a.8., converges to 11X00=0} so that 

P(X 0O = 0) = limE(P(X 00 = 0\? p )) < Clirninf = 0. n 



u 



2 



Lemma 2. Assume <pa = Pb < 1 aric ^ assumption (4). Then, for ev- 
ery x £ (0, 1], 

(17) p x (X oo = 0) = 0. 
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Proof of Lemma 2. In that case, the algorithm can be written as 
follows [see the specification of the algorithm in (8)]: 

X n+ i = X n + 7 n+1 l Bn+1 (l{u n+1 <X n } ~ x n) with B n+ i = {V n+1 < p B }- 

By conditioning on the a-field generated by the events B n , n> 1, we eas- 
ily deduce from Lemma 1 that if the sequences (A^) n£ N, defined by A^ = 
7nlB n /rifc=i( 1 -7fc 1 Bfe) an d In = ln^-B n satisfy assumption (13) in Lemma 1, 
the announced result is proved. 
Now, for n > 1, define 

n n 

M n = J2 iogt 1 - TfclsJ - Pslog(l - 7fc) = lo §( 1 ~ 7fc)( 1 B fc - Pb)- 

k=l k=l 

The sequence (M n ) is a martingale and sup n EM^ < +oo because J2n 7n < 00 
(as follows from the remark below Proposition 5). Therefore, the ratio 

m =1 (i-ik) pB . , ij 

is a.s. bounded. 



n£=i(i-7fciB fc ) 

Consequently, there exists a a(B n ,n > l)-measurable random positive con- 
stant £ such that, P^-a.s., 

Inequality (15) and X)n>i7n < +°° imply that S n < Ce r ". In turn, assump- 
tion (4) yields 

<£r„e- psr "(e r ")^<er„. 

Now, a straightforward martingale argument shows that, a.s., 

7fc ~ 7fe 
fc=i fc=i 

so that A^ = 0(7^ + • • • + 7^) and the proposition follows from Lemma 1. 

□ 

Proof of Proposition 5. Using Proposition 3 (pathwise compari- 
son result), one may assume without loss of generality that pa = Pb > 0. 
Lemma 2 completes the proof. □ 

4. The two-armed bandit algorithm as a generalized Polya urn. In this 
section we propose two proofs of Proposition 5 — in some special cases — in 
which the martingale case is based on methods directly inspired by the Polya 
urn. 
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4.1. Short background on Polya urn. Assume that an urn contains at 
time 0, r red balls and b black balls. At every time n one draws at random 
a ball from the urn and then puts it back in the urn with another ball of the 
same color. Then, at every time n, the urn contains (once the new ball has 
been put in the urn) exactly r + b + n balls. Let (3 n denote the number of 
black balls inside the urn at time n, let X n := r +£ +n denote the proportion 

of black balls at time n and Y n := One models the drawings using a 
sequence {U n ) n >i of i.i.d. random variables uniformly distributed over [0, 1] 
as follows: if U n +\ < X n , the ball drawn at time n + 1 is black, otherwise it 
is red. Then, these sequences satisfy, respectively, 

Po:=b and n+1 = f3 n + l{u n+1 <x„}> 

X °'- = ^l and Xn+1 = X n+ r + b + n + l ( 1 {a n+1 <X„} - Xn)- 

Consequently, the regular Polya urn appears as a special case of the two- 
armed-bandit algorithm (in the martingale setting pa = Pb = 1) correspond- 
ing to a rational starting value Xq = and a step j n := r+ l +n , that is, 
A„ = 

This suggests to try extending some classical methods of proof devised 
for the Polya urn to solve the martingale case of the two-armed bandit 
algorithm, with the hope, in some cases, to get more accurate results, for 
example, concerning the distribution of the limit X^. 

4.2. The moment approach. Following a classical method devised to solve 
the Polya urn (see, e.g., [1]), it is possible to obtain some moment estimates 
for the limiting distribution of the X n 's in the martingale case pa =Pb = 1- 
When A n = A > 0, this limiting distribution is even explicit. 

Proposition 6. Assume that p a =pb = 1 and that the sequence (A n ) n >i 
is nonincreasing (A\ may be greater than A$ = 1). 

(a) For every x G [0, 1] and for every integer m > 1, 

Wi^II 1 -^ and ai-^r +1 )<n i-f • 

(18) 

In particular, for every x S (0, 1), 



: (Z oo = l)<infE :c (XS)=0 and P^Xoo = 0) < inf E x ((l - X, 



oo I ) 



since E fc >i ^ > Efc>i T+kl = +°°- 
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(b) If, moreover A n = A > 0, n > 1 (and Aq = 1), then 7„ = n £ +1 and 

c fx l-x 

-&oo ~ P 



A A 

Remark. Item (b) is a classical result about Polya urn. 

Proof, (a) One uses the notation Y n = S n X n of Lemma 1 and sets 



(m) _ Y n Y n + A n+1 Y n + A n+1 + • • • + A nH 



E x (Z%\/F n ) 

= ^MZ\Mu^>X n }l^n)+^MZ\Mu n+1 <X n }l^n) 

Y n Y n + A n+ 2 in + A n+ 2 + • • • + A n+m+ i 

X • • • X - (1 — A n J 



Sn+1 'S'n+2 •S'n+m+l 

| Y n + A w+1 y ra + A w+1 + A ra+2 

Sn+1 SW+2 

A n +i + • • • + A n+m+ i 

x o -^n- 

<->n+m+l 

Recall that X„ = ^ and 1 - X n = Sn ~ Yn , Hence, 

_ S n — Y n ^Y n Y n -\- A n+ 2 Y n + A n+ 2 + • • • + A n _|_ m _|_i 

Sn+m+l S n S n +i S n -\- m 



since A n+i+1 <A n+i 



Y n -\- A n+ \ + • • • + A n+m+ i 



n+m+l 

Y n Y n + A n+ i Y n + A n+1 + • • • + A nH 

x — X • • • X 



5*71 'S'n+m 



7 (m) 



The sequence (Zn )n>o is then a super-martingale. On the other hand 
Z n obviously converges toward X m+l since, for every k < m, ^" +fc < 
-pA 1 ► as n — > +00 because the sequence (A n ) n >i is nonincreasing and 
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S n | +00 [see (15)] in the preliminary remark following Lemma 1. Conse- 
quently, for instance, via Fatou's lemma, for every ra£N, 



F / vm+l\ < 7 {m) _ TT X + Aj -j h A fc 

X - 1 + Sfc 



k=l 
m 



5 fc 



1 — X 



k=0 

One proceeds symmetrically with X n := 1 — X n and 1^, = S n — Y n to establish 
the moment inequalities concerning 1 — X x . 

Hence, the Lebesgue dominated convergence theorem implies that when 

m — > 00, 

Px(^oo = 0) = limE a: ((l - Xj m+1 ) < IJ (l - I") = 

since £ n > ^ > £ n > T+^A7 = +°°- 

(b) When A n = A, n > 1, the same proof shows that (Zn)n>o is a 
martingale, hence, for every m > 0, 

m / 1 — x \ m x/A + 

re + 1 ) = n ^ - t^ia J = n (i-^/A+x/A+fe • 

Hence, X^ has the moments of a /3(x/A;(l — x)/A) distribution. Both 
distributions have compact support, hence, they are equal. □ 

The above result can be extended to the general martingale case. 

Corollary 3. If Pa = Pb £ (0,1] and the sequence (A n ) n >i is non- 
increasing (Ai may be greater than Aq = 1). Then, for every x £ (0,1) 
lPx(^oo = 1) = ^x(Xoo = 0) = and, for every m£N, 

m / k \ 

(19) k x (xz +i ) < n 1 - a - *) IK 1 - 7<) 

fc=o V £=1 / 

m / k \ 

(20) e,((i - ^) m+i ) < n 1 - « na - 7*) ■ 

/c=o V e=i / 

Proof. Dealing with the case p:=pA =Pb < 1 still needs conditioning 
with respect to the cr-algebra generated by the events B n . However, we will 
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proceed slightly differently from in the proof of Proposition 5. We introduce 
the successive stopping times 

VwGft, 7jf(u;):=0, (w) := min{fc > T^Ju G B k }, n > 1. 

The s. finite iff ps > 0. Then, set ~f n := j t b and A n and S n := 

1 + Ai H + A n , n > 1, as in Lemma 1 so that 7 n = A n /S n . One checks 

that 

A , 1 A s T n+i -1 A s 

= 11 (l-7fe)<-* > 

^71 T n /c = T S + l T n 



so that A n is nonincreasing as long as A n is. It follows from Proposition 6 
and obvious equalities that, for every m > 0, 

E,(i^)<Ejn 1 



and 



m+l 

00 ; 



^(SO-f))- 



Now S n < 1 + nAi, n > 1. Then, proceeding as in the proof of Proposi- 
tion 6(a) yields 

, - / nr. \ 

= 0. 



Px^oc = 0) = limE^l - X^ 1 ) < 1] 1 " TTTX 

fc>o V i + fc^ 

One gets similarly that W X {X 00 = 1) =0. The moment bounds follow from 
the easy fact that A n < A„ so that S n < S n = (rjfc=i(l — 7a,-)) -1 - D 

Remark. The above bounds (19) and (20) do not involve ps- In fact, 
they can be improved by replacing 7 n by 7 T s in their right-hand side and 
taking the expectation with respect to E x . Then, one may use that — Tn-i 
is i.i.d. with Geometric distribution G(pb)- 



4.3. The log-martingale approach. The log-martingale method is an- 
other classical approach to Polya urn. It also yields a new proof of Lemma 1 
when the sequence (A re ) n >i is bounded. We use the same notations as in 
the original lemma. 

Proposition 7. Assume pa =Pb = 1 o,nd (A n ) n >i is bounded. 
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(a) Then, there exists a martingale (N n ) n >i with bounded increments 
such that, for every x G (0, 1), 



sup 



log 



l-X r , 



N r . 



< +oo, 



,-a.s. 



(b) Consequently, for every x G (0, 1), F x (X OQ = 0) = P x -(^oo = 1) = 0. 
Proof, (a) Set Z n := log(j^) = log( s ^) = log Y n - log(S n - F n ). 

/ " Ay fe \ 

log Y„ - logx + }^ — — 



<E 



k=l 



log 



\Y k . 



ay. 



< l"(AY k \ 2 



< 



sup n A n A A Y fc c 2 sup n A n ^ A Y s 



E 



2 ^ Y 2 
k=i r k-i 



< 



E 



Y 2 

k>l Y k 



< +oo, 



where c := 1 + sup n A n /x satisfies Y k /Y k _\ < 1 + A k /Y k ^\ < c. Similarly, 

io g (5 n - Y n ) - ( io g (i - x) + f; 



fc=l 



S/c-i — Y* 



fc-i 



sup n A nv ^ (A fc -AY fc ) 
9 2^ 7c V — ^ < +°°- 

Combining these inequalities yields 



sup 



z _J2(^Y k A k -AY k 



fe=1 Sfc-i — Yfc„i 



< +oo, 



-a.s. 



Now iV n := - sf_\~ ) is a martingale since 



E x (AN n /F n . 



1 



Yn-l 

•^n-l\ (1 — ^n-l)A 



(A re -E :r (AF n /^ n _ 1 )) 



Sn—l Y 7 



n-1 



A, 



A., 



0. 



Y n —i S n —\ Yi— l S n —\ S n —i 

Furthermore, its increments are bounded. As a matter of fact AY k and 
A k — AY k are never simultaneously zero and are upper-bounded by A k ; on 
the other hand, Y k _\ and S k _\ — Y k _\ are lower bounded by x and 1 — x, 
respectively, hence, for every k > 1, 



lAJVfcl < 



A fc 



< 



Al 



Yfc_i A (5 fc _i - Yfc-i) " x A (1 - x) 
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(b) Let ((iV) n ) n >i denote the conditional variance increment process of 
the martingale (N n ) n >i and let (A r ) oc denote its limit as n goes to in- 
finity. The law of iterated logarithm for martingales with bounded incre- 
ments says that, on the event {(N)^ = +00}, the martingale (N n ) satisfies 
liminf n iV n = —00 and limsup n iV n = +00 a.s. Meanwhile Z n converges to- 
ward log( ijy ) G M a.s. The difference of these two quantities remaining 
bounded, it follows that (A r ) oc < +00, P^-a.s. Hence, the martingale N n 
converges toward a finite limit and, consequently, £ (0, 1), P x -a.s. □ 

Remark. The above assumption in Proposition 7 does not embody the 
Power step (III) setting of Corollary 2(b), (pa = Pb = 1 and) A n ~ Clogn, 
that is, the closest case to the critical case that we can get. 

The extension to general pa and pb, < pb < pa, i n that framework 
consists in proving that the sequence (A^) n >i is a.s. bounded. One shows 
using martingale methods of Lemma 2 that this leads to the condition j n = 
0{e- pBTn ) which is, as expected, more stringent than assumption (4). 

5. Rate of convergence, stopping rules. 

5.1. Rate of convergence. The aim of this section is not to elucidate 
completely the rate of convergence of the two-armed bandit algorithm but to 
draw some first conclusions from some by-products of the convergence proof. 
They emphasize that the two-armed bandit algorithm does not behave like a 
standard stochastic approximation algorithm in terms of rate of convergence. 
In particular, in some natural situations it may converge infinitely faster 
than its associated deterministic algorithm in average. This enlightens that 
the usual CLT for stochastic algorithms proposed in the literature (see, e.g., 
[5]) does not apply. 

First, let us have a look at the algorithm in average, 

x n+1 = x n + 7T7„ + ix n (l - x n ), x = xe(0, 1) with tt = p A - Pb > 0. 

One has by a straightforward induction that the sequence (x n ) n >o is increas- 
ing and that 



ft 



< 1 - X n = 



(1-x) Y[{1 - n-fkXk-!) 



fc=i 



(21) 




< (1 — X) exp I — 7T ^ 

V l<k<n 

< (1 — x) exp(— TTxT n ). 



(22) 
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Plugging (22) into (21) yields 

0<l-x rt <(l-x)exp(-7rr ra + 7r(l-x) £ 7fce -' rar *-i 

< (1 - x) exp ^-vrr n + vr(l - x)e nx e~ nxu du\ 

< (1 - x) exp ( ( - - l) e nx ) e~ rf " 



(23) =0{e- nT "). 

On the other hand, for every n > 1, 

n 

1 - X n > (1 - x) JJ (1 - 7T7 fe ). 

fc=l 

In particular, if one assumes that Xm7n < +°°> then there are some positive 
real constants C(x) and C'{x) such that, for every n > 1, 

(24) C(x)<e^ r "(l-x n )<C / (x). 

Now, let us come back to the original procedure with the specification 
given by (8). By an obvious symmetry argument, one shows, as in the proof 
of Proposition 4, that the events 

Ioo,x ■= j^n < 1 - (1 - x) JJ (1 - 7fclA fe ) for every n > 1 j 



and 



satisfy 



)iB := j X n = 1 - (1 - x) H (1 - lk l Ak ) for every n > 
I fc=i 



Now, still following the proof of Proposition 4, P;r(Ioo,z) > as soon as 
Enllfc=i(l ~~ PAlk) < +oo. Moreover, if J2n7n < +°°) the proof of Lemma 2 
shows that 



n*=l(l-TfclA k )a.B, 



Cg(0,+oo) as n — > co. 



Hence, 



n*=i(i-7fe^ 

C' 6 (0, +oo) as n — > oo 



EDLlU -7fclA fe ) a 
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so that 

n 

(25) e^ r " Y[(l- lk lA k )^("£ (0,+oo) asrwoo. 

fc=i 

This leads to the following result concerning the rate of convergence of the 
algorithm (stated here in the infallible case, but an analogous phenomenon 
occurs in the fallible case for the equilibrium 0). 

Proposition 8. Assume that < pb < Pa < 1 and that 

n 

(26) 7n = 0(r n e-^ r ") and IK 1 ~ P^) < +oo. 

n>lk=l 

Then, the two-armed bandit algorithm is a.s. infallible and, for every x G 
(0,1), there exists an event of positive P x -probability on which X n is 

nondecreasing and 

(27) e PAVn {l-X n ) ^'£E (0,+oo) asn^+oo. 
Assumption (26) is fulfilled, for example, when 7„ = jj^, with < C < j^. 

Remark, (i) Comparing the rates obtained in (24) and in (27), respec- 
tively, shows that, for step sequences satisfying (26), the two-armed bandit 
algorithm converges toward its "target" equilibrium 1 on an event with 
positive probability infinitely faster than the corresponding algorithm "in 
average." More generally, the same phenomenon occurs at least at one of 
the equilibrium points as soon as 

^7.^<+oo and ]T JJ (1 - max(p A ,p B )j k ) < +oo. 

n n >lk=l 

This unusual behavior in the field of stochastic approximation is confirmed 
by the fact that the assumptions of the standard central limit theorem for 
recursive stochastic algorithms (at rate y^Tn, see [5] among others) are never 
fulfilled by the two-armed bandit algorithm: when pa^Pb the martingale 
increment AM n involved in the canonical decomposition (6) of the algorithm 
satisfies 

E x ({AM n+1 ) 2 /F n ) < X n {l - X n ) ^ as n - +oo, 

whereas this term is supposed to converge toward some positive real number 
to apply the CLT. 

(ii) Proposition 4 can be slightly improved using the same ingredients as 
above. Namely, if J2n7n < +oo, then, for every x € (0, 1), 

Px({X n goes to monotonously for large enough n}) > 

n 

iff Hi 1 ~ PB7k) < +oo 

n>lfc=l 
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and 

^x({X n goes to 1 monotonously for large enough n}) > 

n 

n>lfc=l 

By symmetry, it suffices to establish the equivalence, for example, for the 
equilibrium 0. By the Markov property this amounts to showing that if 

En ll < 

n 

Wx{Ioo,x) > if and only if ^ J[ (1 - p A j k ) < +oo. 

n>lfc=l 

The equivalence follows from (25) and the Lebesgue dominated convergence 
Theorem applied to the identity 

p*(w) = ej n ( 1 - n (! - ) ) . 

\n>l\ k=l J J 

5.2. Stopping rules. The proposition below derives an upper-bound for 
the conditional error probability at time n based on some inequality used in 
the proof of Lemma 1. 

Proposition 9. Assume that pa,Pb G [0,1] and pa i^PB- Let X^ = 
a.s.-\\m. n X n and let = l{» A > Ps } be the "target" parameter of the algo- 
rithm. Then, for every n > 1, 

^x {Xqq 7^ ^00/ J~n) 

, ( . (1 — X n Efc>n.7fc+l\ / X n J2k>nlk+l\\ 

< max (mm , j,n^_, j j . 

Proof. Assume for the sake of simplicity that pa > Pb so that Xoo = 1. 
Assume that the events A n and B n involved in the dynamics of (A" n ) n >o 

are specified by (8). Then, for every n > 1, one considers (-X^ )fc>n the 
(martingale) algorithm defined for every k > n by 

(28) 

= + 7fc+llB fc+1 - Xf! 1 ). 

It follows from Proposition 3 that, for every n > and for every k > n, 
X { k n) <X k ,so that x£ ] := a.s.- lim fe X^ n) < X^. 
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Now, as in the proof of Lemma 1, one notices that 
P^ = 0/? n ) < P X (X<$ = 0/? n ) 

E^X^-X^) 2 /^) 



< 



X 2 



A straightforward computation based on (28) then shows that the condi- 
tional variance increment process of X^ is given for every k > n by 

£=n 

Consequently, still as in the proof of Lemma 1, 
(29) P^ = 0/*.) < ^^7f +1 E»(xf>(l " -V," ) 



^ PBEk>n7l + lMxl n) /Fn) 

X 2 

v( n ) 2 
PBXn Efc>n7fc+1 

X 2 

n 

J2k>n1k+1 

x n 



< 



On the other hand, we know from (7) that 

k 



E x (XP(i-XW)/r n ) = x n (i-x n ) n (i-pbtI) 

£=n+l 



so that 



PB^ 7 , 2 +1 E x .(4 n) (l-4" ) )/^)=X„(l-X n )(l- II (l-Pfl72)V 

fc>n V k>n+l / 



Plugging this identity in (29) yields 



1 — X„ I . T-r 1,1 . 1 — A n 



p x (x 00 =o/^)<— ^ l- n (i-p B7 fe) < 

^ n V k>n+l / ^ n 

The upper-bound for P X (X 00 = l/J- n ) follows from a symmetry argument. 
□ 

6. Additional results. 
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Regularity of n-> ¥ x (X QO = 1) when pa > Pb- One can obtain some reg- 
ularity results for the function x i— > ¥ x (X OQ = 1) as soon as pa j^Pb (keep in 
mind that in that setting, X^ is {0, l}-valued). Namely, 

Proposition 10. If pa> pb, the function x i-> P X (A" 00 = 1) is nonde- 
creasing and analytic on (0, 1] . 

Proof. The only point to establish is analyticity. We sketch the proof 
in the case of a constant step sequence. One starts from the second equal- 
ity in (11) and the tools developed in the proof of Theorem 2. We also 
adopt the same notations. Indeed, function X7 is analytic on (0, 1] since it 
is an absolutely decreasing function. Then, tp^ is analytic as well and conse- 
quently so is x i ^ F x (X n — > 1). The extension to nonconstant step sequences 
is straightforward. □ 

About the distribution of Xqq. When <pa = PB < 1 and J2 n >i7n < 
+00, the conditional distribution of X^ given {X^ 7^ 0, 1} is continuous. 
This follows from Theorem 3. IV. 13 in [5]. 

Still open questions. . . The main open question is, of course, to find a 
necessary and sufficient condition for the algorithm to be a.s. infallible. For 
example, when pa =PB = 1, assumption (3) is easier to express using the 
partial sums S n of the A n 's by 

(30) J2 IT < +°°- 

n>l ° n 

If A n = logn log2 n, assumption (30) is equivalent to /3 > 1, whereas A n = 
0(T n ) in Lemma 1 reads (3 = 0. So we are facing a log log problem. 

Furthermore, it follows from the Borel-Cantelli lemma for independent 
events that 

Y n 

hmsup— > limsupl {c/n > x/ 5 n _ l} = 1, F x -a.s. 

n i-i n n 

when J2n>i 1/S n = +00. 

It is to be noticed that, when A n = lognlog^ n for some (3 € (0,1), this 
straightforwardly implies that limsup n Y n j log S n = +00 (which was the key 
step of Lemma 1). Unfortunately, for such sequences A n , Lemma 1 only 
implies that F X (X X = 0) = iff limsup n ^ > 1, P a -a.s. 

The last remark. Let a n := min{A; > a n -\/Uk < X^-i} and 00 := de- 
note the increasing break times. Assumption (30) is equivalent to F x (ai = 
+00) > for every x £ (0, 1). 

Otherwise, all the <7 n 's are P x -a.s. finite for every x G (0, 1): this follows 
from the expression of P^ci > k) and from the Markov property. 
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